Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart workers after executing 10 tasks to mitigate memory leaks #840

Merged
merged 5 commits into from
Jul 23, 2024

Conversation

Adam-D-Lewis
Copy link
Contributor

@Adam-D-Lewis Adam-D-Lewis commented Jul 2, 2024

Fixes #nebari-dev/nebari#2418

Description

This pull request:

  • restarts celery workers after they execute 10 tasks (docs here). This mitigates memory leaks which are occurring.

Pull request checklist

  • Did you test this change locally?
  • Did you update the documentation (if required)?
  • Did you add/update relevant tests for this change (if required)?

Additional information

How to test

@trallard edited from: #840 (comment)

To test, you could check out this branch, then run conda standalone conda.store/conda-store/how-tos/install-standalone or you can run it in [docker containers](docker-compose up --build -d as explained here). After you startup, you can visit localhost:8080/conda-store/admin and rebuild the environment it comes with a few times. If you watch the docker memory usage (docker stats) you'll see that the memory usage does not continue to climb with each new build as it does without this option set.

Copy link

netlify bot commented Jul 2, 2024

Deploy Preview for conda-store ready!

Name Link
🔨 Latest commit d0c6085
🔍 Latest deploy log https://app.netlify.com/sites/conda-store/deploys/66a01029aea96c0009c5fb96
😎 Deploy Preview https://deploy-preview-840--conda-store.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@Adam-D-Lewis
Copy link
Contributor Author

Adam-D-Lewis commented Jul 2, 2024

@dcmcand @peytondmurray @kcpevey @trallard Could I get a review on this?

@Adam-D-Lewis
Copy link
Contributor Author

I'm happy to make this configurable via a traitlet as well if desired, but the worker restarts seem happen quickly so I can't see too much down side to just applying it for all conda store deployments.

@kcpevey
Copy link
Contributor

kcpevey commented Jul 3, 2024

I can't speak to the technical aspect of this, but I'm very glad to see it. I see the memory leakage on all my deployments. As far as I can tell it leads to an erroneously failed build due to OOM - which also generally fails with errors that make no sense. All around, this is a big admin management and end user UX improvement.

Copy link
Contributor

@peytondmurray peytondmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, with this option it looks like celery will just automatically restart the worker after 10 tasks are completed. My one complaint with this is it doesn't strictly speaking fix nebari-dev/nebari#2418 as we are only treating a symptom not the memory usage itself.

But in any case from what I have read it seems like this is exactly the kind of use case the option was intended for, so I guess let's just use it. Is there an easy way to test this? If so it would be nice to ensure that it works.

@peytondmurray peytondmurray added status: in progress 🏗 impact: high 🟥 This issue affects most of the conda-store users or is a critical issue area: user experience 👩🏻‍💻 Items impacting the end-user experience type: maintenance 🛠 needs: testing ✅ labels Jul 3, 2024
@Adam-D-Lewis
Copy link
Contributor Author

Adam-D-Lewis commented Jul 4, 2024

Huh, with this option it looks like celery will just automatically restart the worker after 10 tasks are completed. My one complaint with this is it doesn't strictly speaking fix nebari-dev/nebari#2418 as we are only treating a symptom not the memory usage itself.

I agree it does just treat the symptom without fixing the underlying issue. That said, I think this will work robustly.

But in any case from what I have read it seems like this is exactly the kind of use case the option was intended for, so I guess let's just use it. Is there an easy way to test this? If so it would be nice to ensure that it works.

To test, you could check out this branch, then run conda standalone https://conda.store/conda-store/how-tos/install-standalone or you can run it in [docker containers](docker-compose up --build -d as explained here). After you startup, you can visit localhost:8080/conda-store/admin and rebuild the environment it comes with a few times. If you watch the docker memory usage (docker stats) you'll see that the memory usage does not continue to climb with each new build as it does without this option set.

@peytondmurray
Copy link
Contributor

To test, you could check out this branch, then run conda standalone https://conda.store/conda-store/how-tos/install-standalone or you can run it in [docker containers](docker-compose up --build -d as explained here). After you startup, you can visit localhost:8080/conda-store/admin and rebuild the environment it comes with a few times. If you watch the docker memory usage (docker stats) you'll see that the memory usage does not continue to climb with each new build as it does without this option set.

Ahh, what I meant was is there a simple way of adding automated testing?

@trallard
Copy link
Collaborator

trallard commented Jul 4, 2024

Ahh, what I meant was is there a simple way of adding automated testing?

I am +1 with Peyton here. It is good that we can manually verify (once or so), but I would prefer having some sort of automated tests (akin to load tests) to ensure this is, in fact, a useful fix.
Those tests could later be used to also do some benchmarking/memory checks.
So while the fix itself seems fine I would suggest holding on merging until the tests have been added.

@Adam-D-Lewis
Copy link
Contributor Author

Adam-D-Lewis commented Jul 9, 2024

I've followed up on where the memory leak is occurring using memray. I'm new to memray, but the flamegraph seems to point to line 20 in list_conda_prefix_packages at /opt/conda-store-server/conda_store_server/_internal/action/add_conda_prefix_packages.py which is prefix_data.load() on line 20 here. You can see more of the call stack in the screenshot. I believe what memray is telling us is that the data that is loaded on that line stays in memory for some reason. Not that that particular line causes the leak, but that is the object that is sticking around.

I'm not sure what that code is doing or why that would cause a memory leak, but I'm recording it here for now. I'll try to find out more later or how to do a more targeted fix.

image

@peytondmurray
Copy link
Contributor

Thanks, this is actually enough to start properly narrowing this down to something we can properly troubleshoot. I still think it would be worth merging this feature as it doesn't seem like there's a real cost to this. I'll open another issue with the goal of actually eliminating the memory leak.

@trallard
Copy link
Collaborator

trallard commented Jul 9, 2024

While I can see why add_conda_prefix_packages.py could lead to memory growth (and potential issues if the json.load() and its in-mem representations are not handled accordingly, I cannot see a definite indication that this is indeed the culprit.
add_conda_prefix_packages.py definitely spends a lot of time in CPU but that does not necessarily mean it is leading to memory leaks or bloat.

So we'd definitely need to do more profiling and mem analysis over time.
@Adam-D-Lewis can you share a minimally reproducible example of how you got to the flame graph you shared.

As per merging, since this PR has been marked as needing tests I think we should still hold until we have a test in place.

@Adam-D-Lewis
Copy link
Contributor Author

Adam-D-Lewis commented Jul 9, 2024

I plan to write what I did to generate the memray flamegraph soon.

As far as a test showing the leak, what I'm planning on doing is running a test which builds many conda envs and show that the memory used by the conda store worker docker container continues to rise with each build.

We could maybe use the limit_memory decorator in pytest-memray to make the test fail if it uses too much memory, but this would only be possible if we use the pytest-celery setup which I was having trouble with so I prefer the docker tests at the moment.

@Adam-D-Lewis Adam-D-Lewis marked this pull request as draft July 9, 2024 19:37
@Adam-D-Lewis
Copy link
Contributor Author

@trallard More info about the flamegraph generation in the new issue Peyton opened - #848 (comment)

@peytondmurray peytondmurray added the needs: discussion 💬 This item needs team-level discussion before scoping label Jul 23, 2024
@Adam-D-Lewis Adam-D-Lewis marked this pull request as ready for review July 23, 2024 20:18
Copy link
Contributor

@peytondmurray peytondmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a discussion with @Adam-D-Lewis and @trallard, it seems like this is potentially only observable in docker. For now we will just merge this now without a test as it is low risk, and come back to testing memory consumption later on once we know more.

@peytondmurray peytondmurray merged commit 1df30a2 into conda-incubator:main Jul 23, 2024
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: user experience 👩🏻‍💻 Items impacting the end-user experience impact: high 🟥 This issue affects most of the conda-store users or is a critical issue needs: discussion 💬 This item needs team-level discussion before scoping needs: testing ✅ status: in progress 🏗 type: maintenance 🛠
Projects
Status: Done 💪🏾
Development

Successfully merging this pull request may close these issues.

5 participants